Goal!! Event detection in sports video
نویسندگان
چکیده
Understanding complex events from unstructured video, like scoring a goal in a football game, is an extremely challenging task due to the dynamics, complexity and variation of video sequences. In this work, we attack this problem exploiting the capabilities of the recently developed framework of deep learning. We consider independently encoding spatial and temporal information via convolutional neural networks and fusion of features via regularized Autoencoders. To demonstrate the capacities of the proposed scheme, a new dataset is compiled, composed of goal and no-goal sequences. Experimental results demonstrate that extremely high classification accuracy can be achieved, from a dramatically limited number of examples, by leveraging pretrained models with fine-tuned fusion of spatio-temporal features. Introduction Analyzing unstructured video streams is a challenging task for multiple reasons [10]. A first challenge is associated with the complexity of real world dynamics that are manifested in such video streams, including changes in viewpoint, illumination and quality. In addition, while annotated image datasets are prevalent, a smaller number of labeled datasets are available for video analytics. Last, the analysis of massive, high dimensional video streams is extremely demanding, requiring significantly higher computational resources compared to still imagery [11]. In this work, we focus on the analysis of a particular type of videos showing multi-person sport activities and more specifically football (soccer) games. Sport videos in general are acquired from different vantage points and the decision of selecting a single stream for broadcasting is taken by the director. As a result, the broadcasted video stream is characterized by varying acquisition conditions like zooming-in near the goalpost during a goal and zooming-out to cover the full field. In this complex situation, we consider the high level objective of detecting specific and semantically meaningful events like an opponent team scoring a goal. Succeeding in this task will allow the automatic transcription of games, video summarization and automatic statistical analysis. Despite the many challenges associated with video analytics, the human brain is able to extract meaning and provide contextual information in a limited amount of time and from a limited set of training examples. From a computational perspective, the process of event detection in a video sequence amounts to two foundamental steps, namely (i) spatio-temporal feature extraction and (ii) example classification. Typically, feature extraction approaches rely on highly engineered handcrafted features like the SIFT, which however are not able to generalize to more challenging cases. To achieve this objective, we consider the state-of-theart framework of deep learning [18] and more specifically the case of Convolutional Neural Networks (CNNs) [16], which has taken by storm almost all problems related to computer vision, ranging from image classification [15, 16], to object detection [17], and multi-modal learning [6]. At the same time, the concept of Autoencoders, a type of neural network which tries to appropriate the input at the output via regularization with various constrains, is also attracting attention due to its learning capacity in cases of unsupervised learning [21]. While significant effort has been applied in designing and evaluating deep learning architectures for image analysis, leading to highly optimized architectures, the problem of video analysis is at the forefront of research, where multiple avenues are explored. The urgent need for video analytics is driven by both the wealth of unstructured videos available online, as well as the complexities associated with adding the temporal dimension. In this work, we consider the problem of goal detection in broadcasted low quality football videos. The problem is formulated as a binary classification of short video sequences which are encoded though a spatiotemporal deep feature learning network. The key novelties of this work are to: • Develop a novel dataset for event detection in sports video and more specifically, for goal detection is football games; • Investigate deep learning architectures, such as CNN and Autoencoders, for achieving efficient event detection; • Demonstrate that learning, and thus accurate event detection, can be achieved by leveraging information from a few labeled examples, exploiting pre-trained models. State-of-the-art For video analytics, two major lines of research have been proposed, namely frame-based and motion-based, where in the former case, features are extracted from individual frames, while in the latter case, additional information regarding the inter-frame motion, like optical flow [3], is also introduced. In terms of single frame spatial feature extraction, CNNs have had a profound impact in image recognition, scene classification, and object detection, among others [16]. To account for the dynamic nature of video, a recently proposed concept involves extenting the two-dimensional convolution to three dimensions, leading to 3D CNNs, where temporal information is included as a distinct input [12, 13]. An alternative approach for encoding the temporal informarion is through the use of Long-Short Term Memory (LSTM) networks [1, 13], while another concept involves the generation of dynamic images through the collapse of multiple video frames and the use of 2D deep feature exaction on such representations [7]. In [2], temporal information is encoded through average pooling of frame-based descriptors and Figure 1: Block diagram of the proposed Goal detection framework. A 20-frame moving window initially selects part of the sequence of interest, and the selected frames undergo motion estimation. Raw pixel values and optical flows are first independently encoded using the pre-trained deep CNN for extracting spatial and temporal features. The extracted features can either be introduced into a higher level network for fusion which is fine-tuned for the classification problem, or concatenated and used as extended input features for the classification. the subsequent encoding in Fisher and VLAD vectors. In [4], the authors investigated deep video representation for action recognition, where temporal information was introduced in the frame-diff layer of the deep network architecture, through different temporal pooling strategies applied in patch-level, frame-level, and temporal window-level. One of the most successful frameworks for encoding both spatial and temploral information is the two-stream CNN [8]. Two-stream networks consider two sources of information, raw frames and optical flow, which are independently encoded by a CNN and fused into an SVM classifier. Further studies on this framework demonstrated that using pre-trained models can have a dramatic impact on training time, for the spatial and temporal features [22], while convolutional two-stream network fusion was recently applied in video action recognition [23]. The combination of 3D convolutions and the two-stream approach was also recently reported for video classification, achieving state-of-theart performance at significantly lower processing times [24]. The performance demonstrated by the two-streams approach for video analysis led to the choice of this paradigm in this work. Event Detection Network The proposed temporal event detection network is modeled as a two-stream deep network, coupled with a sparsity regularized Autoencoder for fusion of spatial and temporal data. We investigate Convolutional and Autoencoder Neural Networks for the extraction of spatial, temporal and fused spatio-temporal features and the subsequent application of kernel based Support Vector Machines for the binary detection of goal events. A high level overview of the processing pipeline is shown in Figure 1. While in fully-connected networks each hidden activation is computed by multiplying the entire input by the corresponding weights in that layer, in CNNs each hidden activation is computed by multiplying a small local input against the weights. The typical structure of a CNN consists of a number of convolution and pooling/subsampling layers, optionally followed by fully connected layers. At each convolution layer, the outputs of the previous layer are convolved with learnable kernels and passed through the activation function to form this layer’s output feature map. Let n× n be a square region extracted from a training input image X ∈ RN×M , and w be a filter of kernel size (m×m). The output of the convolutional layer h ∈ R(n−m+1)×(n−m+1) is given by: hi j = σ (m−1 ∑ a=0 m−1 ∑ b=0 wabx (i+a)( j+b)+b ` i j ) , (1) where b is the additive bias term, and σ(·) stands for the neuron’s activation unit. Specifically, the activation function σ , is a standard way to model a neuron’s output, as a function of its input. Convenient choices for the activation function include the logistic sigmoid, the hyperbolic tangent, and the Rectified Linear Unit. Taking into consideration the training time required for the gradient descent process, the saturating (i.e tanh, and logistic sigmoid) non-linearities are much slower than the non-saturating ReLU function. The output of the convolutional layer is directly utilized as input to a sub-sampling layer that produces downsampled versions of the input maps. There are several types of pooling, two common types of which are max-pooling and average-pooling, which partition the input image into a set of non-overlapping or overlapping patches and output the maximum or average value for each such sub-region. For the 2D feature extraction networks, we consider the VGG-16 CNN architecture, which is composed of 13 convolutional layers, with five of them being followed by a max-pooling layer, leading to three fully connected layers [9]. Unlike image detection problems, feature extraction in video must address the challenges associated with the variation of the duration of events, in addition to the challenges related to illumination and viewpoint variability. We fuse the two representations using a sparsity regularized Autoencoder and more specifically, we consider all available training data from all classes as input to the unsupervised Autoencoder for extracting features encoding both spatial and temporal information. Formally, the Autoencoder is a deterministic feed-forward artificial neural network comprised of an input and an output layer of the same size with a hidden layer in between, which is trained with backpropagation in a fully unsupervised manner, aiming to learn an approximation ẑ of the input which would be ideally more descriptive than the raw input. The feature mapping that transforms an input pattern z ∈ Rn into a hidden representation h′ (called code) of k neurons (units), is defined by the encoder function: f (z) = h′ = α f (W1z+b1), (2) where α f : R 7→ R is the activation function applied componentwise to the input vector. The activation function is usually chosen to be nonlinear; examples include the logistic sigmoid and the hyperbolic tangent. The activation function is parametrized by a weight matrix W1 ∈ Rk×n with models the connections between the input and the hidden layer and a bias vector b1 ∈Rk×1. The network output is then computed by mapping the resulting hidden representation h′ back into a reconstructed vector ẑ ∈ Rn×1 using a separate decoder function of the form: g( f (z)) = ẑ = αg(W2h+b2), (3) where αg is the activation function, W2 ∈ Rn×k is the decoding matrix, and b2 ∈Rn a vector of bias parameters which are learned from the hidden to the output layer. The estimation of the parameter set θ = {W1,b1,W2,b2} of an Autoencoder is achieved through the minimization of the reconstruction error between the input and the output according to a specific loss function. Given the training set Z, a typical loss function seeks to minimize the normalized sum of squares error, defining the following optimization objective:
منابع مشابه
Generic play-break event detection for summarization and hierarchical sports video analysis
This paper proposes a single generic real-time (or near real-time) play-break event detection algorithm for multiple sports, which include football, tennis, basketball, and soccer. The proposed algorithm only uses shot-based generic cinematic features, such as shot type and shot length. Detected play-break events are employed for two purposes: 1) All plays in certain sports, such as football an...
متن کاملGoal Event Detection in Soccer Videos via Collaborative Multimodal Analysis
Detecting semantic events in sports video is crucial for video indexing and retrieval. Most existing works have exclusively relied on video content features, namely, directly available and extractable data from the visual and/or aural channels. Sole reliance on such data however, can be problematic due to the high-level semantic nature of video and the difficulty to properly align detected even...
متن کاملSemantic Concept Mining Based on Hierarchical Event Detection for Soccer Video Indexing
In this paper, we present a novel automated indexing and semantic labeling for broadcast soccer video sequences. The proposed method automatically extracts silent events from the video and classifies each event sequence into a concept by sequential association mining. The paper makes three new contributions in multimodal sports video indexing and summarization. First, we propose a novel hierarc...
متن کاملA Comparison of Rule based and Distance Based Semantic Video Mining
In this paper, a subspace-based multimedia data mining framework is proposed for video semantic analysis, specifically video event/concept detection, by addressing two basic issues, i.e., semantic gap and rare event/concept detection. The proposed framework achieves full automation via multimodal content analysis and intelligent integration of distance-based and rule-based data mining technique...
متن کاملSemantic indexing of sports program sequences by audio-visual analysis
Semantic indexing of sports videos is a subject of great interest to researchers working on multimedia content characterization. Sports programs appeal to large audiences and their efficient distribution over various networks should contribute to widespread usage of multimedia services. In this paper, we propose a semantic indexing algorithm for soccer programs which uses both audio and visual ...
متن کاملA fusion scheme of visual and auditory modalities for event detection in sports video
In this paper, we propose an effective fusion scheme of visual and auditory modalities to detect events in sports video. The proposed scheme is built upon semantic shot classification, where we classify video shots into several major or interesting classes, each of which has clear semantic meanings. Among major shot classes we perform classification of the different auditory signal segments (i....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016